Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loom test for deadlock observed in tokio's test suite #6876

Closed
wants to merge 1 commit into from

Conversation

jofas
Copy link
Contributor

@jofas jofas commented Sep 28, 2024

This PR adds a Loom test for the deadlock observed in #6847.

When I run this test locally on my machine with

LOOM_MAX_PREEMPTIONS=1 LOOM_MAX_BRANCHES=10000 RUSTFLAGS="--cfg loom -C debug_assertions" \
    cargo test --lib --release --features full pool_deadlock_on_blocked_task \
    -- --test-threads=1 --nocapture

I get the following error:

running 1 test
test runtime::tests::loom_multi_thread::group_d::pool_deadlock_on_blocked_task ... thread 'runtime::tests::loom_multi_thread::group_d::pool_deadlock_on_blocked_task' panicked at /home/masterusr/.cargo/registry/src/index.crates.io-6f17d22bba15001f/loom-0.7.2/src/rt/execution.rs:216:13:
deadlock; threads = [(Id(0), Blocked(Location(None))), (Id(1), Blocked(Location(None))), (Id(2), Blocked(Location(None)))]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'runtime::tests::loom_multi_thread::group_d::pool_deadlock_on_blocked_task' panicked at /home/masterusr/.cargo/registry/src/index.crates.io-6f17d22bba15001f/loom-0.7.2/src/rt/thread.rs:276:39:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0:     0x56425478aad5 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
   1:     0x5642547b34db - core::fmt::write::hc090a2ffd6b28c4a
   2:     0x56425478834f - std::io::Write::write_fmt::h8898bac6ff039a23
   3:     0x56425478a8ae - std::sys_common::backtrace::print::ha96650907276675e
   4:     0x56425478c229 - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
   5:     0x56425478bf6d - std::panicking::default_hook::h207342be97478370
   6:     0x56425478c7f6 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
   7:     0x56425478c56b - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
   8:     0x56425478af99 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
   9:     0x56425478c2d7 - rust_begin_unwind
  10:     0x5642543ae4e3 - core::panicking::panic_fmt::hdc63834ffaaefae5
  11:     0x5642543ae58c - core::panicking::panic::h75b3c9209f97d725
  12:     0x5642543ae489 - core::option::unwrap_failed::h4b4353bf890a85df
  13:     0x5642545f8aff - loom::rt::object::Ref<T>::set_action::hd5b09cd3dece6232
  14:     0x56425461154c - scoped_tls::ScopedKey<T>::with::hd6ef3a1bee7ec98b
  15:     0x5642545e6bce - loom::rt::atomic::Atomic<T>::store::h5d2d323740f21a8e
  16:     0x56425459365c - tokio::runtime::scheduler::multi_thread::park::Parker::park::h4ae71e780fadb3a8
  17:     0x56425450599e - tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::hc7658d589be3126b
  18:     0x56425450476c - tokio::runtime::scheduler::multi_thread::worker::Context::run::h57894510918d1e2b
  19:     0x56425454aafd - tokio::runtime::context::scoped::Scoped<T>::set::h8d9c484a2b1a5a11
  20:     0x5642544bdab5 - loom::thread::LocalKey<T>::try_with::h3f0132ee1c91aba6
  21:     0x564254524aeb - tokio::runtime::context::runtime::enter_runtime::h3ce4255900a8aedd
  22:     0x564254503b15 - tokio::runtime::scheduler::multi_thread::worker::run::h1f5b7e23b8e40277
  23:     0x5642544a9567 - loom::cell::unsafe_cell::UnsafeCell<T>::with_mut::h476d68d1ebb373c5
  24:     0x5642544eac54 - tokio::runtime::task::core::Core<T,S>::poll::h7aabc50663325fdb
  25:     0x56425442454e - tokio::runtime::task::harness::Harness<T,S>::poll::h56ead0a7ce702948
  26:     0x5642545a69be - tokio::runtime::blocking::pool::Inner::run::hfe127e926858c8af
  27:     0x564254562e2e - core::ops::function::FnOnce::call_once{{vtable.shim}}::hfbc82a17d67087f1
  28:     0x564254601527 - generator::stack::StackBox<F>::call_once::heb5e6a7940558221
  29:     0x56425475ed0b - std::panicking::try::h92c7df7a6cc6dd01
  30:     0x56425475f0c8 - generator::detail::gen::gen_init_impl::habe2c082c5ebb920
  31:     0x56425475ef79 - generator::detail::asm::gen_init::h5730af05b288df0e
  32:                0x0 - <unknown>
thread 'runtime::tests::loom_multi_thread::group_d::pool_deadlock_on_blocked_task' panicked at library/core/src/panicking.rs:228:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
error: test failed, to rerun pass `--lib`

Caused by:
  process didn't exit successfully: `/home/masterusr/src/tokio/target/release/deps/tokio-367a8771b781d33e pool_deadlock_on_blocked_task --test-threads=1 --nocapture` (signal: 6, SIGABRT: process abort signal)

which I believe signifies that the test is able to successfully replicate the deadlock.

I used oneshot channels instead of barriers as is done in the flaky test where the deadlock was first observed, because Loom currently does not support Barriers.

I'm opening this up as a Draft PR because I'm looking for early feedback on whether I'm on the right track here or if I have misunderstood the assignment.

@Darksonn
Copy link
Contributor

Yep, that looks like it catches the bug.

@jofas jofas marked this pull request as ready for review October 1, 2024 06:46
@mox692
Copy link
Member

mox692 commented Oct 7, 2024

Given Carl's suggestion here, I'm wondering if the loom test will succeed or not with one worker thread.

@Darksonn
Copy link
Contributor

Darksonn commented Oct 7, 2024

I think the bug requires at least one worker thread. I think the solution is to make sure that when the runtime is started, (at least one of) the worker threads should be in a searching state.

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Oct 7, 2024
@jofas
Copy link
Contributor Author

jofas commented Oct 7, 2024

if the loom test will succeed or not with one worker thread

If we have only one worker thread and it were to deterministically always run the tasks in the order in which they are spawned, I'd expect the test to always deadlock. Even if it runs the task not deterministically, I believe a deadlock would happen with a lot more frequency (still every time the first task runs before the second) than it does with two or more workers.

The deadlock—as far as I see it—happens when the second worker is parked while the second task (which when executed would unlock the deadlock) is still in some queue elsewhere, the first worker being blocked by the first task, unable to notify the second worker to unpark again.

@mox692
Copy link
Member

mox692 commented Oct 13, 2024

Looking into the deadlock scenario more closely (using loom's checkpoint feature), I suspect that the following is happening:

@carllerche
Copy link
Member

Looking into the deadlock scenario more closely (using loom's checkpoint feature), I suspect that the following is happening:

Thanks for looking into this. In theory, when Worker A finds a task, it transitions out of searching. If it is the last searching worker, it notifies a sleeping worker to wake up and try searching. In theory, WorkerB would wake and process the task. However, that logic is (intentionally) racy. Here, before parking, you can see a safeguard that handles this case. In the case of a worker transitioning out of searching because it finds work, the intent is that once the task is done being polled, the worker will find the rest of the work, thus mitigating the race. In this loom test, the task blocks forever, so the runtime deadlocks.

Unfortunately, there isn't much we can do in this specific case. If we want to make the runtime "bulletproof," the best strategy would be to add some logic to detect blocked tasks, report a warning, and poke the runtime to get unstuck.

We may want to rewrite the original flaky test to avoid blocking the runtime.

Thanks for looking into it, though. I hope that you found it a good learning experience. I'm also happy to answer any further runtime-related questions.

@mox692
Copy link
Member

mox692 commented Oct 16, 2024

In this loom test, the task blocks forever, so the runtime deadlocks.

Yes, I suppose this code is just for testing purposes, so I'm sure that it would be a different scenario with an ideal async task that does not block for a long time.

We may want to rewrite the original flaky test to avoid blocking the runtime.

I agree. @jofas Are you still interested in this issue? (If not, I can do that test fix)

@jofas
Copy link
Contributor Author

jofas commented Oct 16, 2024

Are you still interested in this issue? (If not, I can do that test fix)

I'd be interested in fixing the test, but I'm not exactly sure how we can determine the injection queue depth without blocking the two workers so that they can't consume any tasks from it. My initial idea was to keep the workers busy by continuously filling their LIFO slots with tasks so that they don't start consuming the injection queue, but I don't know if that'd work and even then it sounds a bit brittle to me. My second idea was to have only a single worker, block it like we already do and then fill up the injection queue, but I'm not sure if that'd be considered good enough of a test for the multithreaded runtime. How would you fix the test, if you don't mind me asking?

@Darksonn
Copy link
Contributor

I think it's okay to say that if the test doesn't succeed within some timeout, you just restart the test. Then retry up to, let's say, 10 times. (You probably have to get the blocking tasks to exit to restart the test.)

@Darksonn
Copy link
Contributor

I'm guessing this should be closed now that we merged #6916?

@jofas
Copy link
Contributor Author

jofas commented Oct 21, 2024

I agree. Should this discussion be summarized somewhere more prominently, in case people actually trigger the deadlock in their codebase?

@Darksonn
Copy link
Contributor

We already have open bugs about tolerating blocking tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants